[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs by dbsanfte · Pull Request #3593 · ROCm/composable_kernel

dbsanfte · 2026-01-16T11:32:46Z

Summary

This PR is an aggregation of fixes discovered while working with ComposableKernel on gfx906 (MI50/MI60) GPUs. These GPUs don't have MFMA instructions, so they rely on the DeviceGemmDl path which has some edge cases that aren't well-tested.

Note: This is a draft PR that will be updated as we discover more issues.

Fix 1: Buffer Load OOB Crash with Large K and Small M

Problem

DeviceGemmDl crashes on gfx906 when K >= 1472 with small M (e.g., M=1 decode case in LLM inference).

The crash occurs in gridwise_gemm_dl_v1r3.hpp during block_sync_lds() after an invalid buffer load.

Root Cause

CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK was disabled by default (set to 0).

Without the offset trick:

Buffer loads execute unconditionally
Bounds check happens after the load returns
If the load address is unmapped, the GPU crashes before the bounds check matters

With the offset trick enabled:

Invalid coordinates get 0x80000000 added to offset
This flags the load as OOB to hardware
Hardware safely returns zero instead of accessing unmapped memory

Solution

include/ck/ck.hpp: Enable CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK by default
include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v5r1.hpp: Use coordinate_has_valid_offset() instead of coordinate_has_valid_offset_assuming_visible_index_is_valid() for full bounds validation

Verification

INT8 GEMM tests pass for:

M=1 (single-row decode)
K up to 14336
FFN projection dimensions for Qwen2.5 and Llama3

Fix 2: GridwiseGemmDlMultipleD Element Op Type Mismatch (FloatAcc != FloatC)

Problem

When FloatAcc differs from FloatC (e.g., INT8×INT8→INT32 accumulator with FP32 output scaling), the CDE element op is invoked with wrong storage types.

The element op contract is: (E& e, const C& c, const D& d...) where:

E = FloatC (final output type, e.g., float)
C = FloatAcc (accumulator type, e.g., int32_t)

Root Cause

Original code at lines 615-618 used generate_tie() returning the same c_thread_buf for both E& and C&:

auto dst_data_refs = generate_tie(
    [&](auto) -> auto& { return c_thread_buf(Number<c_offset>{}); },
    Number<2>{});

This causes:

Type mismatch when FloatAcc != FloatC (element op expects float& for e, gets int32_t&)
Compile errors with strictly-typed element ops
Undefined behavior during ThreadwiseTensorSliceTransfer which type-puns FloatAcc bits as FloatC

This bug has existed since the file was created in December 2022 (PR #517).

Solution

include/ck/tensor_operation/gpu/grid/gridwise_gemm_dl_multiple_d.hpp:

Introduce separate e_thread_buf<FloatC> for element op output
Pass (E& e) from e_thread_buf and (const C& c) from c_thread_buf using tie()
Transfer e_thread_buf (not c_thread_buf) to global memory

Minimal Repro

See original PR #3565 for compile-time repro that demonstrates the type mismatch.

Environment

GPU: gfx906 (MI50)
ROCm: 7.1.1
Use case: INT8×INT8→INT32 GEMM with FP32 output for LLM inference

Problem: DeviceGemmDl crashes on gfx906 when K >= 1472 with small M (M=1 decode case). Root cause: CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK was disabled by default. Without this, invalid buffer loads execute and crash before bounds checking can prevent them. Solution: 1. Enable the OOB offset trick (0x80000000) so invalid coordinates safely return zero instead of accessing unmapped memory 2. Use full coordinate_has_valid_offset() check instead of the _assuming_visible_index_is_valid variant for proper K bounds validation Verified with INT8 GEMM tests: M=1 decode, K=14336, FFN projections.

Problem: When FloatAcc differs from FloatC (e.g., INT8×INT8→INT32 accumulator with FP32 output scaling), the CDE element op is invoked with wrong storage types. The element op contract is: (E& e, const C& c, const D& d...) where: - E = FloatC (final output type, e.g., float) - C = FloatAcc (accumulator type, e.g., int32_t) Original code used generate_tie() returning the same c_thread_buf for both E& and C&, which: 1. Violates the element op signature when types differ 2. Causes compile errors with strictly-typed element ops 3. Results in undefined behavior during ThreadwiseTensorSliceTransfer Solution: Introduce separate e_thread_buf<FloatC> for element op output, pass (E& e) from e_thread_buf and (const C& c) from c_thread_buf, then transfer e_thread_buf to global memory. Bug has existed since the file was created in December 2022 (PR ROCm#517).

Copilot

Pull request overview

This PR addresses critical bugs in ComposableKernel's DeviceGemmDl implementation for gfx906 (MI50/MI60) GPUs that lack MFMA instructions. The fixes resolve out-of-bounds memory access crashes and type safety issues affecting INT8 GEMM operations with mixed accumulator and output types.

Changes:

Enable hardware-assisted OOB protection by default for buffer loads on gfx906
Fix bounds checking logic to validate all coordinates, preventing invalid memory accesses
Correct type mismatches in element operations when accumulator type differs from output type

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
include/ck/ck.hpp	Enables offset trick for OOB buffer load protection by default
include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v5r1.hpp	Strengthens bounds validation to check all coordinates including visible indices
include/ck/tensor_operation/gpu/grid/gridwise_gemm_dl_multiple_d.hpp	Introduces separate buffer for element op output with correct type (FloatC) to fix type mismatch when FloatAcc differs from FloatC

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/ck/tensor_operation/gpu/grid/gridwise_gemm_dl_multiple_d.hpp

- Add CK_GFX906_DEBUG macro for conditional debug output - Log GEMM parameters (M, N, K, strides) for gfx906 devices - Track which device GEMM variants are being invoked - Helps diagnose launch bounds and occupancy issues on older GCN

- Comment out always-on std::cout debug spam in device_gemm_multiple_d_dl.hpp - Add optional CK_DEBUG_KERNEL-gated logging in gridwise_gemm_dl_v1r3.hpp - Fixes console spam on every GEMM call for gfx906 devices

ammallya · 2026-02-03T22:01:57Z

Imported to ROCm/rocm-libraries

dbsanfte added 2 commits January 16, 2026 11:32

afagaj requested a review from Copilot January 19, 2026 17:39

Copilot AI reviewed Jan 19, 2026

View reviewed changes

include/ck/tensor_operation/gpu/grid/gridwise_gemm_dl_multiple_d.hpp Show resolved Hide resolved

include/ck/tensor_operation/gpu/grid/gridwise_gemm_dl_multiple_d.hpp Show resolved Hide resolved

Add debug logging for gfx906 GEMM compatibility

5046c60

- Add CK_GFX906_DEBUG macro for conditional debug output - Log GEMM parameters (M, N, K, strides) for gfx906 devices - Track which device GEMM variants are being invoked - Helps diagnose launch bounds and occupancy issues on older GCN

dbsanfte force-pushed the gfx-906-issues branch from cf801de to 5046c60 Compare January 31, 2026 17:04

Gate unconditional kernel debug logging behind CK_DEBUG_KERNEL

58b17c3

- Comment out always-on std::cout debug spam in device_gemm_multiple_d_dl.hpp - Add optional CK_DEBUG_KERNEL-gated logging in gridwise_gemm_dl_v1r3.hpp - Fixes console spam on every GEMM call for gfx906 devices

dbsanfte force-pushed the gfx-906-issues branch from 5503add to 58b17c3 Compare January 31, 2026 17:08

assistant-librarian bot mentioned this pull request Feb 3, 2026

[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs ROCm/rocm-libraries#4289

Draft

ammallya closed this Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs#3593

[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs#3593
dbsanfte wants to merge 4 commits intoROCm:developfrom
dbsanfte:gfx-906-issues

dbsanfte commented Jan 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dbsanfte commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix 1: Buffer Load OOB Crash with Large K and Small M

Problem

Root Cause

Solution

Verification

Fix 2: GridwiseGemmDlMultipleD Element Op Type Mismatch (FloatAcc != FloatC)

Problem

Root Cause

Solution

Minimal Repro

Environment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dbsanfte commented Jan 16, 2026 •

edited

Loading